Previous studies observed that finetuned models may be better base models than the vanilla pretrained model. Such a model, finetuned on some source dataset, may provide a better starting point for a new finetuning process on a desired target dataset. Here, we perform a systematic analysis of this intertraining scheme, over a wide range of English classification tasks. Surprisingly, our analysis suggests that the potential intertraining gain can be analyzed independently for the target dataset under consideration, and for a base model being considered as a starting point. This is in contrast to current perception that the alignment between the target dataset and the source dataset used to generate the base model is a major factor in determining intertraining success. We analyze different aspects that contribute to each. Furthermore, we leverage our analysis to propose a practical and efficient approach to determine if and how to select a base model in real-world settings. Last, we release an updating ranking of best models in the HuggingFace hub per architecture https://ibm.github.io/model-recycling/.
translated by 谷歌翻译
Pretraining has been shown to scale well with compute, data size and data diversity. Multitask learning trains on a mixture of supervised datasets and produces improved performance compared to self-supervised pretraining. Until now, massively multitask learning required simultaneous access to all datasets in the mixture and heavy compute resources that are only available to well-resourced teams. In this paper, we propose ColD Fusion, a method that provides the benefits of multitask learning but leverages distributed computation and requires limited communication and no sharing of data. Consequentially, ColD Fusion can create a synergistic loop, where finetuned models can be recycled to continually improve the pretrained model they are based on. We show that ColD Fusion yields comparable benefits to multitask pretraining by producing a model that (a) attains strong performance on all of the datasets it was multitask trained on and (b) is a better starting point for finetuning on unseen datasets. We find ColD Fusion outperforms RoBERTa and even previous multitask models. Specifically, when training and testing on 35 diverse datasets, ColD Fusion-based model outperforms RoBERTa by 2.45 points in average without any changes to the architecture.
translated by 谷歌翻译
近年来,具有两个较高架构的视觉语言(VL)模型主导了视觉表示的学习。当前的VL模型要么使用轻型Uni-Modal编码器,并在交叉模式编码器中同时提取,对齐和融合这两种模态,或者将最后一层的Uni-Modal-Modal特征直接馈入顶部的交叉模式编码器,而忽略了语义深度单模式编码器中不同级别的信息。两种方法都可能限制视觉表示学习和限制模型性能。在本文中,我们介绍了多个桥梁层,该层在Uni-Modal编码器的顶层和跨模式编码器的每一层之间建立了连接。这可以在不同语义级别的视觉和文本表示之间进行全面的自下而上相互作用,从而导致更有效的跨模式对齐和融合。我们提出的桥梁可以预先训练,仅需$ 4 $ m的图像,可以在各种下游视觉语言任务上实现最先进的性能。在VQAV2 Test-STD集合中,Bridge-Tower的准确性为$ 78.73 \%$,与以前的最先进的仪表型号相同的the Art仪表均优于先前的最先进的仪表\%$ $,并且几乎没有其他参数,并且几乎没有其他参数和其他参数计算成本。值得注意的是,当进一步扩展模型时,桥梁可以达到81.15美元\%$的准确性,超过了在较大的数据集中预先训练的模型。代码可在https://github.com/microsoft/bridgetower上找到。
translated by 谷歌翻译
We present the task of PreQuEL, Pre-(Quality-Estimation) Learning. A PreQuEL system predicts how well a given sentence will be translated, without recourse to the actual translation, thus eschewing unnecessary resource allocation when translation quality is bound to be low. PreQuEL can be defined relative to a given MT system (e.g., some industry service) or generally relative to the state-of-the-art. From a theoretical perspective, PreQuEL places the focus on the source text, tracing properties, possibly linguistic features, that make a sentence harder to machine translate. We develop a baseline model for the task and analyze its performance. We also develop a data augmentation method (from parallel corpora), that improves results substantially. We show that this augmentation method can improve the performance of the Quality-Estimation task as well. We investigate the properties of the input text that our model is sensitive to, by testing it on challenge sets and different languages. We conclude that it is aware of syntactic and semantic distinctions, and correlates and even over-emphasizes the importance of standard NLP features.
translated by 谷歌翻译
用功能近似的增强学习最近在具有较大状态空间的应用中取得了巨大的结果。这一经验成功促使人们越来越多的理论工作提出了必要和充分的条件,在这些条件下,有效的强化学习是可能的。从这项工作中,已经出现了一种非常简单的最小条件,以进行样品有效的增强学习:具有最佳价值函数的MDP $ v^*$和$ q^*$线性在某些已知的低维功能中。在这种情况下,最近的作品设计了样品有效算法,这些算法需要在特征维度中多个样本,并且独立于状态空间的大小。但是,他们将发现计算高效的算法作为未来的工作,这被认为是社区中的主要开放问题。在这项工作中,我们通过呈现线性函数近似的RL的第一个计算下限来取得进展:除非NP = RP,否则对于确定性的过渡MDP,不存在任何随机多项式时间算法,具有恒定的动作和线性最佳值功能。为了证明这一点,我们显示了唯一SAT的减少,在该SAT中,我们将CNF公式转换为具有确定性转换,恒定动作数量和低维线性最佳值函数的MDP。该结果还表现出具有线性函数近似的增强学习中的第一个计算统计差距,因为潜在的统计问题在理论上是可以通过多项式查询的信息来解决的,但是除非NP = rp,否则不存在任何计算有效算法。最后,我们还证明了在随机指数时间假设下的准多项式时间下限。
translated by 谷歌翻译
可实现和不可知性的可读性的等价性是学习理论的基本现象。与PAC学习和回归等古典设置范围的变种,近期趋势,如对冲强劲和私人学习,我们仍然缺乏统一理论;等同性的传统证据往往是不同的,并且依赖于强大的模型特异性假设,如统一的收敛和样本压缩。在这项工作中,我们给出了第一个独立的框架,解释了可实现和不可知性的可读性的等价性:三行黑箱减少简化,统一,并在各种各样的环境中扩展了我们的理解。这包括没有已知的学报的模型,例如学习任意分布假设或一般损失,以及许多其他流行的设置,例如强大的学习,部分学习,公平学习和统计查询模型。更一般地,我们认为可实现和不可知的学习的等价性实际上是我们调用属性概括的更广泛现象的特殊情况:可以满足有限的学习算法(例如\噪声公差,隐私,稳定性)的任何理想性质假设类(可能在某些变化中)延伸到任何学习的假设类。
translated by 谷歌翻译